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We present a new algorithm for approximate inference in probabilistic programs, 
based on a stochastic gradient for variational programs. This method is efficient 
without restrictions on the probabilistic program; it is particularly practical for 
distributions which are not analytically tractable, including highly structured dis- 
tributions that arise in probabilistic programs. We show how to automatically 
derive mean-field probabilistic programs and optimize them, and demonstrate that 
our perspective improves inference efficiency over other algorithms. 
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1 Introduction 



1.1 Automated Variational Inference for Probabilistic Programming 

ON 



Probabilistic programming languages simplify the development of probabilistic models by allowing 
programmers to specify a stochastic process using syntax that resembles modern programming lan- 
guages. These languages allow programmers to freely mix deterministic and stochastic elements, 
resulting in tremendous modeling flexibility. The resulting programs define prior distributions: run- 
ning the (unconditional) program forward many times results in a distribution over execution traces, 
with each trace generating a sample of data from the prior. The goal of inference in such programs is 
to reason about the posterior distribution over execution traces conditioned on a particular program 
output. Examples include BLOG [1], PRISM [2], Bayesian Logic Programs OL Stochastic Logic 
Programs 0, Independent Choice Logic &, IBAL 0, Probabilistic Scheme Q], A d], Church 
H, Stochastic Matlab d, and HANSEI pill . 

It is easy to sample from the prior p(x) defined by a probabilistic program: simply run the program. 
But inference in such languages is hard: given a known value of a subset y of the variables, inference 
must essentially run the program 'backwards' to sample from p(x\y). Probabilistic programming 
environments simplify inference by providing universal inference algorithms; these are usually sam- 
ple based (MCMC or Gibbs) jHlalMl^ due to their universality and ease of implementation. 



Variational inference B 1 21 - 4 1 411 offers a powerful, deterministic approximation to exact Bayesian in- 
ference in complex distributions. The goal is to approximate a complex distribution p by a simpler 
parametric distribution qg; inference therefore becomes the task of finding the qg closest to p, as mea- 
sured by KL divergence. If qg is an easy distribution, this optimization can often be done tractably; 
for example, the mean-field approximation assumes that qg is a product of marginal distributions, 
which is easy to work with. 

Since variational inference techniques offer a compelling alternative to sample based methods, it 
is of interest to automatically derive them, especially for complex models. Unfortunately, this is 
intractable for most programs. Even for models which have closed-form coordinate descent equa- 
tions, the derivations are often complex and cannot be done by a computer. However, in this paper, 
we show that it is tractable to construct a stochastic, gradient-based variational inference algorithm 
automatically by leveraging compositionality. 
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2 Automated Variational Inference 



An unconditional probabilistic program / is defined as a parameterless function with an arbitrary 
mix of deterministic and stochastic elements. Stochastic elements can either belong to a fixed set of 
known, atomic random procedures called ERPs (for elementary random procedures), or be defined 
as function of other stochastic elements. 

The syntax and evaluation of a valid program, as well as the definition of the library of ERPs, define 
the probabilistic programming language. As the program / runs, it will encounter a sequence of 
ERPs x\, ■ ■ • ,xt, and sample values for each. The set of sampled values is called the trace of 
the program. Let xt be the value taken by the < th ERP. The probability of a trace is given by the 
probability of each ERP taking the particular value observed in the trace: 

T 

p(x) = l[p t (x t \Mht)) (1) 

t=i 

where h t is the history (xi, . . . , Xt-i) of the program up to ERP t, p t is the probability distribution of 
the t th ERP, with history-dependent parameter ip t (ht)- Trace-based probabilistic programs therefore 
define directed graphical models, but in a more general way than many classes of models, since 
the language can allow complex programming concepts such as flow control, recursion, external 
libraries, data structures, etc. 



2.1 KL Divergence 

The goal of variational inference lfT2T - [l4ll is to approximate the complex distribution p(x\y) with a 
simpler distribution pe(x). This is done by adjusting the parameters 9 of pg(x) in order to maximize 
a reward function L(9), typically given by the KL divergence: 

KL(p e ,p(x\y)) = ^p e ( x )log(^r 

Pe (x) log ( I + \ogp(y) = -L{9) + logp(y) (2) 



^p(y\x)p(x) 

where (3) 

L(0) ± [pe( X )lo g ( P -^m (4) 
Jx \ Pe{x) J 



Since the KL divergence is nonnegative, the reward function L(9) is a lower bound on the partition 
function log p(y) = log J p(y\x)p(x); the approximation error is therefore minimized by maximiz- 
ing the lower bound. 

Different choices of pe(x) result in different kinds of approximations. The popular mean-field ap- 
proximation decomposes pg (x) into a product of marginals as pg(x) — Ylt=i P8( x t\&t)> where every 
random choice ignores the history h t of the generative process. 



2.2 Stochastic Gradient Optimization 

Minimizing Eq. [2] is typically done by computing derivatives analytically, setting them equal to 
zero, solving for a coupled set of nonlinear equations, and deriving an iterative coordinate descent 
algorithm. However, this approach only works for conjugate distributions, and fails for highly struc- 
tured distributions (such as those represented by, for example, probabilistic programs) that are not 
analytically tractable. 
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One generic approach to solving this is (stochastic) gradient descent on L(9). We estimate the 
gradient according to the following computation: 



(6) 



Pe (x)Velog(p e (x)) (log ( M *\ )) (8) 
V \PKV\ X )PK X ) J J 

pe(x)\7 e log( Pe (x)) (log ( P f*\ ) + k) (9) 



with x- 7 ~ Pe{x), j = 1...N and K is an arbitrary constant. To obtain equations (7-9), we 



repeatedly use the fact that V logpe{x) — Vep ?ffl ■ Furthermore, for equations (|7) and (|9), we also 



use J x p e (x)\7logp g (x) = 0, since f x p e (x)X7 logp e (x) = J x V g pg(x) = Vg j x pe{x) = Vgl = 
0. The purpose of adding the constant K is that it is possible to approximately estimate a value 
of K (optimal baseline), such that the variance of the Monte-Carlo estimate ( fTOb of expression (0 
is minimized. As we will see, choosing an appropriate value of K will have drastic effects on the 
quality of the gradient estimate. 

2.3 Compositional Variational Inference 

Consider a distribution p(x) induced by an arbitrary, unconditional probabilistic program. Our goal 
is to estimate marginals of the conditional distribution p(x\y), which we will call the target pro- 
gram. We introduce a variational distribution p$(x), which is defined through another probabilistic 
program, called the variational program. This distribution is unconditional, so sampling from it is 
as easy as running it. 

We derive the variational program from the target program. An easy way to do this is to use a 
partial mean-field approximation: the target probabilistic program is run forward, and each time 
an ERP xt is encountered, a variational parameter is used in place of whatever parameters would 
ordinarily be passed to the ERP. That is, instead of sampling Xt from ptixt \ ipt(ht)) as in Eq.Q] we 
instead sample from p t (xt \ t (ht)), where 9 t (h t ) is an auxiliary variational parameter (and the true 
parameter ipt{ht) is ignored). 

Fig.rjjillustrates this with pseudocode for a probabilistic program and its variational equivalent: upon 
encountering the normal ERP on line 4, instead of using parameter mu, the variational parameter 
03 is used instead (normal is a Gaussian ERP which takes an optional argument for the mean, 
and rand (a, b) is uniform over the set [a, b], with [0, 1] as the default argument). Note that a 
dependency between X and M exists through the control logic, but not the parameterization. Thus, 
in general, stochastic dependencies due to the parameters of a variable depending on the outcome 
of another variable disappear, but dependencies due to control logic remain (hence the term partial 
mean-field approximation). 

This idea can be extended to automatically compute the stochastic gradient of the variational dis- 
tribution: we run the forward target program normally, and whenever a call to an ERP x t is made, 
we: 

• Sample x t according to pg t (xt) (if this is the first time the ERP is encountered, initialize t to an 
arbitrary value, for instance that given by ipt(hf)). 

• Compute the log-likelihood logpe t (x t ) of x t according to the mean-field distribution. 

• Compute the log-likelihood logp(x t \h t ) of Xt according to the target program. 

• Compute the reward R t — logp{x t \h t ) — logpe{x t ) 
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Probabilistic program A 



1: M = normal ( ) ; 

2: if M>1 

3: mu = complex_deterministic_f unc ( M ); 
4: X = normal ( mu ) ; 

5: else 

6: X = rand ( ) ; 

7: end; 



Mean-Field variational program A 

1: M = normal ( Q\ ) ; 

2: if M>1 

3: mu = complex_deterministic_f unc ( M ); 
4: X = normal ( 9 3 ) ; 

5: else 

6: X = rand (6» 4 ,6> 5 ) ; 

7: end; 

Figure 1: A probabilistic program and corresponding variational program 



• Compute the local gradient ip t — V# t \ogpg t (xt). 

When the program terminates, we simply compute log p(y\x), then compute the gain i? = i? t + 
\ogp(y\x) + K. The gradient estimate for the t th ERP is given by Ri/j t , and can be averaged over 
many sample traces x for a more accurate estimate. 

Thus, the only requirement on the probabilistic program is to be able to compute the log likelihood 
of an ERP value, as well as its gradient with respect to its parameters. Let us highlight that being 
able to compute the gradient of the log-likelihood with respect to natural parameters is the only 
additional requirement compared to an MCMC sampler. 

Note that everything above holds: 1) regardless of conjugacy of distributions in the stochastic pro- 
gram; 2) regardless of the control logic of the stochastic program; and 3) regardless of the actual 
parametrization of p(xt; 9t). In particular, we again emphasize that we do not need the gradients of 
deterministic structures (for example, the function complex_deterministic_f unc in Fig.[TJ- 

2.4 Extensions 

Here, we discuss three extensions of our core ideas. 

Learning inference transfer. Assume we wish to run variational inference for N distinct datasets 
y 1 , . . . , y N . Ideally, one should solve a distinct inference problem for each, yielding distinct 
9 1 , ■ ■ ■ , 9 N . Unfortunately, finding 9 , • • • ,9 does not help to find 9 N+1 for a new dataset yjv+i- 
But perhaps our approach can be used to learn 'approximate samplers': instead of 6 depending on y 
implicitly via the optimization algorithm, suppose instead that t depends on y through some fixed 
functional form. For instance, we can assume 8 t (y) = J2j a i,j fj iv)< wnere fj is a known function, 
then find parameters 9 t j such that for most observations y, the variational distribution pe( v )(x) is a 
decent approximate sampler to p(x\y). Gradient estimates of a can be derived similarly to Eq. |2]for 
arbitrary probabilistic programs. 

Structured mean-field approximations. It is sometimes the case that a vanilla mean-field distribu- 
tion is a poo r approximation of the posterior, in which case more structured approximations should 
be used fll5l4l8ll . Deriving the variational update for structured mean-field is harder than vanilla 
mean-field; however, from a probabilistic program point of view, a structured mean-field approxi- 
mation is simply a more complex (but still unconditional) variational program that could be derived 
via program analysis (or perhaps online via RL state-space estimation), with gradients computed as 
in the mean-field case. 

Online probabilistic programming One advantage of stochastic gradients in probabilistic pro- 
grams is simple parallelizability. This can also be done in an online fashion, in a similar fashion 
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to recent work for stochastic variational inference by Blei et al. lfl9l - l2lll . Suppose that the set of 
variables and observations can be separated into a main set X, and a large number N of independent 
sets of latent variables Xi and observations Yi (where the (Xi, Yi) are only allowed to depend on 
X). For instance, for LDA, X represents the topic distributions, while the represent the docu- 
ment distribution over topics and Yi topic i. Recall the gradient for the variational parameters of 
X is given by Kipx, with K — Rx + Ylii^i + ^ogP(Yi\Xi, X), where Rx is the sum of re- 
wards for all ERPs in X, and Ri is the sum of rewards for all ERPs in Xi. K can be rewritten as 
X + NE[R V + log P(Y V \X V , X), where v is a random integer in {1, ... , N}. The expectation can 
be approximately computed in an online fashion, allowing the update of the estimate of X without 
manipulating the entire data set Y. 

3 Experiments: LDA and QMR 



QMR 

450 r , 
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(a) Results on QMR (b) Results on LDA (c) Steepest descent vs. conjugate 

gradients 

Figure 2: Numerical simulations of AVI 

We tested automated variational inference on two common inference benchmarks: the QMR-DT 
network (a binary bipartite graphical model with noisy-or directed links) and LDA (a popular topic 
model). We compared three algorithms: 

• The first is vanilla stochastic gradient descent on Eq. [2] with the gradients given by Eq. [5] 

• The Episodic Natural Actor Critic algorithm, a version of the algorithm connecting variational 
inference to reinforcement learning - details are reserved for a longer version of this paper. An 
important feature of ENAC is optimizing over the baseline constant K. 

• A second-order gradient descent (SOGD) algorithm which estimates the Fisher information ma- 
trix Fq in the same way as the ENAC algorithm, and uses it as curvature information. 

For each algorithm, we set M = 10 (i.e., far fewer roll-outs than parameters). All three algorithms 
were given the same "budget" of samples; they used them in different ways. All three algorithms 
estimated a gradient <?(#); these were used in a steepest descent optimizer: 6 = 9 + ag{9) with 
stepsize a. All three algorithms used the same stepsize; in addition, the gradients g were scaled to 
have unit norm. The experiment thus directly compares the quality of the direction of the gradient 
estimate. 

Fig. [3] shows the results. The ENAC algorithm shows faster convergence and lower variance than 
steepest descent, while SOGD fares poorly (and even diverges in the case of LDA). Fig. [3] also 
shows that the gradients from ENAC can be used either with steepest descent or a conjugate gradient 
optimizer; conjugate gradients converge faster. 

Because both SOGD and ENAC estimate Fg in the same way, we conclude that the performance 
advantage of ENAC is not due solely to its use of second-order information: the additional step of 
estimating the baseline improves performance significantly. 
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Once converged, the estimated variational program allows very fast approximate sampling from the 
posterior, at a fraction of the cost of a sample obtained using MCMC sampling. Samples from the 
variational program can also be used as warm starts for MCMC sampling. 

4 Related Work 

Natural conjugate gradients for variational inference are investigated in [1250, but the analysis is 
mostly devoted to the case where the variational approximation is Gaussian, and the resulting gradi- 
ent equation involves an integral which is not necessarily tractable. 

The use of variational inference in probabilistic programming is explored in l29ll . The authors 
similarly note that it is easy to sample from the variational program. However, they only use this 
observation to estimate the free energy of the variational program, but they do not estimate the 
gradient of that free energy. While they do highlight the need for optimizing the parameters of the 
variational program, they do not offer a general algorithm for doing so, instead suggesting rejection 
sampling or importance sampling. 

Use of stochastic approximations for variational inference is also used by Carbonetto [30]. Their 
approach is very different from ours: they use Sequential Monte Carlo to refine gradient estimates, 
and require that the family of variational distributions contains the target distribution. While their 
approach is fairly general, it cannot be automatically generated for arbitrarily complex probabilistic 
models. 

Finally, stochastic gradient methods are also used in online variational inference algorithms, in par- 
ticular in the work of Blei et al. in stochastic variational inference (for instance, online LDA lfl9ll . on- 
line HDP |[20j], and more generally under conjugacy assumptions Il2ll0 . as a way to refine estimates 
of latent variable distributions without processing all the observations. However, this approach re- 
quires a manual derivation of the variational equation for coordinate descent, which is only possible 
under conjugacy assumptions which will in general not hold for arbitrary probabilistic programs. 
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